The following machine learning project focuses on…
1 Introduction
1.1 Overview and Motivation
1.1.1 Context and Background
The Swiss real estate market is known for its resilience and complexity, making it an ideal candidate for advanced analytical approaches to understand pricing dynamics. This project, part of a Master’s degree course in Machine Learning at the University of Lausanne, aims to leverage data science to predict real estate market prices in Switzerland. This project provides practical and up-to-date insights into real estate valuation.
With housing prices fluctuating due to factors like interest rate changes and demographic shifts, (Credit Suisse), this study could be valuable for investors, policymakers, and academics alike.
1.1.2 Aim Of The Investigation
The main goal of this study is to predict Swiss real estate prices using real-time data from ImmoScout24, one of the biggest, if not the biggest, Swiss real estate website. Specifically, the study aims to answer:
How accurately can machine learning models predict the market prices of real estate properties in Switzerland based on current market data?
1.1.3 Description of the Data
The main data set for this study comes from ImmoScout24 and includes variables like price, number of rooms, square meters, address, canton, property type, floor, and year of construction. These data points were collected through a sophisticated scraping algorithm, providing a detailed snapshot of the current market. This comprehensive data set with a granular view of the market is essential for training effective machine learning models.
The additional data sets come from the Swiss Federal Statistical Office. This source was used to create the political, demographic and fiscal landscape surrounding the real estate market.
1.1.4 Methodology
The project uses both supervised and unsupervised machine learning techniques to quantify the impact of various factors on property prices in Switzerland. Unsupervised methods were used to reduce the dimensions of external factors, that were then used along our main data set to train predictive models to understand the complex relationships within the data, allowing for an analysis of both linear dependencies and more intricate interactions, such as how location and property type affect pricing.
1.1.5 Structure of the Report
The report is structured to provide a clear and logical analysis:
Section 1: Introduction - Outlines the research context, objectives, and significance.
Section 2: Data - Details the sources, nature, and preprocessing of the data used.
Section 3: Unsupervised Learning - Applies clustering techniques to understand the political, demographic and fiscal landscape.
Section 4: Exploratory Data Analysis (EDA) - Analyzes the data to explore patterns and anomalies.
Section 5: Supervised Learning - Discusses the development and validation of predictive models.
Section 6: Conclusion - Summarizes the findings, discusses the implications, and suggests areas for further research.
2 Data
Sources
Description
Wrangling/cleaning
Spotting mistakes and missing data (could be part of EDA too)
Listing anomalies and outliers (could be part of EDA too)
2.1 Properties Dataset
The dataset is stored in CSV format, comprising real estate listings scraped from the ImmoScout24 platform. It features a variety of fields related to property details
Here is the raw dataset: ::: {.cell layout-align=“center”}
Click to show code
properties <-read.csv(file.path(here(),"data/properties.csv"))# show 1000 first rows of properties using reactable# show 100 first row of cleaned dataset using reactablereactable(head(properties, 2000),bordered =TRUE,striped =TRUE,highlight =TRUE,defaultPageSize =5,showPageSizeOptions =TRUE,showPagination =TRUE,showSortable =TRUE, )
:::
The instances in the dataset represent individual property listings. Each row corresponds to a unique listing on ImmoScout24.
Here is the number of observations by canton: ::: {.cell layout-align=“center”}
We have 22136 observations in total, distributed across different cantons in Switzerland.
2.1.2 Wrangling and Cleaning
The data cleaning process for our dataset involves several careful steps to prepare the data for accurate analysis and modeling. Initially, we identify and address problematic values in the number_of_rooms field, where non-numeric entries are detected and marked for further cleaning. The price field is then sanitized to remove any non-numeric characters, ensuring all data in this column represent valid numerical values.
We exclude properties priced under 20,000 CHF because such low prices are often due to data entry mistakes or listings that do not reflect real market values, which can distort our analysis.
Further cleaning includes standardizing addresses by stripping unwanted characters at the beginning, enhancing uniformity across this variable. We remove rows with any NA values to maintain a dataset with complete cases for analysis. For the square_meters field, we remove non-digit characters and convert the cleansed string to numeric values, discarding any remaining NA entries from this transformation.
We remove NA values from the square_meters column to ensure that all entries are complete and usable for accurate modeling and analysis. Incomplete data can lead to errors and unreliable results.
Categorical data in year_category is truncated for consistency and converted into a factor for potential use in modeling. In the number_of_rooms field, non-relevant text (like “rooms” or “room”) is removed, and we discard erroneous data such as entries mistakenly including “m²” or room counts exceeding 100, which are likely due to data entry oversights. If the number of rooms exceeds 27, we divide the value by ten, assuming these were misreported due to decimal placement errors.
Lastly, we enhance the readability and standardization of the canton field by capitalizing each entry, which is important for categorical consistency across the dataset. These steps ensure the dataset’s integrity and readiness for analytical processes, focusing on maintaining a robust, clean, and usable data structure.
Click to show code
# Identify values causing the issueproblematic_values <- properties$number_of_rooms[is.na(as.numeric(properties$number_of_rooms))]#> Warning: NAs introduced by coercion# Replace non-numeric values with NA#properties$number_of_rooms <- as.numeric(gsub("[^0-9.]", "", properties$number_of_rooms))# Remove non-numeric characters and convert to numericproperties$price <-as.numeric(gsub("[^0-9]", "", properties$price))# Subset the dataset to exclude rows with price < 20000properties <- properties[properties$price >=20000, ]# Subset the dataset to exclude rows with numbers of rooms < 25#properties <- properties[properties$number_of_rooms <25, ]# Replace incomplete addressesproperties$address <-gsub("^\\W*[.,0-]\\W*", "", properties$address)properties_filtered <-na.omit(properties)properties_filtered$year_category <-substr(properties_filtered$year_category, 1, 9)# Assuming 'year_category' is a column in the 'properties' datasetproperties_filtered$year_category <-as.factor(properties_filtered$year_category)# remove m^2 from column 'square_meters'properties_filtered$square_meters <-as.numeric(gsub("\\D", "", properties_filtered$square_meters))# print how many NA observations left in square_metersprint(sum(is.na(properties_filtered$square_meters)))#> [1] 1056# remove NAproperties_filtered <- properties_filtered[!is.na(properties_filtered$square_meters),]# add majuscule to cantonproperties_filtered$canton <- tools::toTitleCase(properties_filtered$canton)# # Preprocess the number_of_rooms columnproperties_filtered$number_of_rooms <-gsub(" rooms", "", properties_filtered$number_of_rooms)properties_filtered$number_of_rooms <-gsub(" room", "", properties_filtered$number_of_rooms)# Remove rows with "m²" in the "number_of_rooms" columnproperties_filtered <- properties_filtered[!grepl("m²", properties_filtered$number_of_rooms), ]properties_filtered$number_of_rooms <-as.numeric(properties_filtered$number_of_rooms)# Remove rows with rooms >= 100properties_filtered <- properties_filtered[properties_filtered$number_of_rooms <=100, , drop =FALSE]# Divide cells by 10 if number of rooms is more than 27properties_filtered$number_of_rooms <-ifelse(properties_filtered$number_of_rooms >27, properties_filtered$number_of_rooms /10, properties_filtered$number_of_rooms)#remove row with weird number of roomsproperties_filtered <- properties_filtered[properties_filtered$number_of_rooms !=7.6, ]# properties_filtered$number_of_rooms <- as.character(properties_filtered$number_of_rooms)# properties_filtered$number_of_rooms <- gsub("\\D", properties_filtered$number_of_rooms) # Remove non-numeric characters# properties_filtered$number_of_rooms <- as.numeric(properties_filtered$number_of_rooms) # Convert to numeric# properties_filtered$number_of_rooms <- trunc(properties_filtered$number_of_rooms) # Truncate non-integer values
Here is the cleaned dataset: ::: {.cell layout-align=“center”}
Click to show code
# show 100 first row of cleaned dataset using reactablereactable(head(properties_filtered, 2000),bordered =TRUE,striped =TRUE,highlight =TRUE,defaultPageSize =5,showPageSizeOptions =TRUE,showPagination =TRUE,showSortable =TRUE, )
:::
Here is a summary of the cleaned dataset: ::: {.cell layout-align=“center”}
The “Official Index of Localities” (Répertoire officiel des localités) is provided by the Swiss Federal Office of Topography (swisstopo). This dataset includes detailed information about all localities in Switzerland and Liechtenstein, such as names, postal codes, and boundaries.
Updated monthly with input from cantonal authorities and Swiss Post, this data set is useful for spatial analysis, integration with other geographic data sets, and use in GIS and CAD systems. It’s also valuable for statistical analysis and as a reference for information systems.
Periodic updates and release notes detail the changes and improvements made. Swisstopo manages and distributes this data set, fulfilling its role in providing official geospatial data for Switzerland.
2.2.1.1 Creating Variable zip_code and merging with AMTOVZ_CSV_LV95
Here we create a new variable zip_code by extracting the zip code from the address column in the properties_filtered data set. We identify the zip code as a 4-digit number within the address and remove any leading digits if the zip code exceeds 4 digits. We then merge the zip_code variable with the AMTOVZ_CSV_LV95 data set to obtain the corresponding city and canton information for each zip code.
Click to show code
df <- properties_filtered#the address column is like : '1844 Villeneuve VD' and has zip code number in it#taking out the zip code number and creating a new column 'zip_code'#the way to identify the zip code is to identify numbers that are 4 digits longdf$zip_code <-as.numeric(gsub("\\D", "", df$address))#removing the first two number of zip code has more than 4 numberdf$zip_code <-ifelse(df$zip_code >9999, df$zip_code %%10000, df$zip_code)
2.2.1.2 Using AMTOVZ_CSV_LV95 to get the city and canton from the zip code
Next, we used the AMTOVZ_CSV_LV95 data set to retrieve additional information such as the city and canton based on the extracted zip codes. This involved the following steps: ::: {.cell layout-align=“center”}
Click to show code
#read .csv AMTOVZ_CSV_LV95amto <-read.csv(file.path(here(),"data/AMTOVZ_CSV_WGS84.csv"), sep =";")#creating a new dataframe with 'Ortschaftsname' as 'City'Place_name', 'PLZ' as 'zip_code', 'KantonskÃ.rzel' as 'Canton_code', 'E' as 'lon' and 'N' as 'lat'amto_df <- amto[, c('Gemeindename', 'PLZ', 'Kantonskürzel', 'E', 'N')]#renaming the columnscolnames(amto_df) <-c('Community', 'zip_code', 'Canton_code', 'lon', 'lat')#remove duplicates of zip codeamto_df <- amto_df[!duplicated(amto_df$zip_code),]#add the variable of amto_df to the df if the zip code matchesdf <-merge(df, amto_df, by ="zip_code", all.x =TRUE)#check if there are nan in citydf[is.na(df$Community),]#> zip_code price number_of_rooms square_meters#> 1 25 2200000 10.0 263#> 2 25 2200000 6.5 165#> 3 26 1995000 7.5 180#> 4 26 655000 3.5 66#> 5 322 870000 2.5 59#> 6 322 880000 2.5 55#> 7 322 975000 3.5 56#> 230 1014 1510000 5.5 146#> 1137 1200 16092000 7.0 400#> 1138 1200 679000 5.5 142#> 1139 1200 3285450 5.0 230#> 5480 1919 2558620 5.5 270#> 5481 1919 1908000 6.5 210#> 5482 1919 785000 3.5 103#> 5483 1919 1065000 4.5 130#> 7623 2500 1100000 5.0 154#> 7624 2500 872500 4.5 144#> 7625 2500 420000 4.5 115#> 7626 2500 1450000 5.5 198#> 7627 2500 885500 5.5 130#> 7628 2500 872500 4.5 138#> 7629 2500 892500 4.5 144#> 7630 2500 885500 5.5 130#> 7631 2500 887500 5.5 130#> 7632 2500 877500 4.5 138#> 7633 2500 887500 4.5 144#> 7634 2500 870500 4.5 125#> 7635 2500 1050000 4.5 121#> 8327 3000 820000 5.5 165#> 8328 3000 1140000 3.5 115#> 8329 3000 1090000 3.5 115#> 8330 3000 920000 4.5 157#> 8331 3000 1090000 5.5 193#> 8332 3000 1090000 5.5 193#> 8333 3000 920000 4.5 157#> 8334 3000 720000 3.5 102#> 8335 3000 1590000 5.5 330#> 10436 4000 180000 3.0 70#> 10437 4000 975000 4.5 125#> 10438 4000 2100000 6.5 360#> 12361 5201 725000 3.5 95#> 13214 6000 695000 4.5 133#> 13967 6511 440000 2.0 64#> 14243 6547 15000000 7.5 220#> 14561 6602 2800000 6.5 250#> 14562 6602 2800000 7.5 242#> 14563 6602 270000 1.5 28#> 14564 6602 450000 3.5 75#> 14565 6604 1990000 4.5 220#> 14566 6604 760000 3.5 78#> 14567 6604 2668590 5.5 290#> 16580 6901 3660930 4.5 290#> 16581 6901 3660930 4.5 290#> 16582 6903 790000 3.5 105#> 16583 6907 995000 4.5 114#> 16584 6907 995000 4.5 114#> 16585 6911 737550 4.5 82#> 16586 6911 469350 5.5 140#> 16587 6911 660000 7.5 200#> 16588 6911 610000 3.5 103#> 17899 7133 2266290 5.5 160#> 17908 7135 2690000 8.5 236#> 18168 8000 2100000 4.5 152#> 18169 8000 1650000 4.5 142#> 18170 8000 925000 3.5 102#> 18171 8000 1650000 4.5 142#> 18172 8000 1150000 4.5 128#> 18173 8000 1450000 5.5 143#> 18174 8000 1990000 5.5 200#> 18175 8000 975000 4.5 122#> 18176 8000 1990000 5.5 200#> 18177 8000 2495000 5.5 482#> 18657 8238 245000 2.0 49#> 19081 8423 2110000 6.5 204#> 19082 8423 2190000 5.5 167#> 20295 9241 545000 4.5 100#> 20296 9241 730840 5.5 130#> address#> 1 1000 Lausanne 25#> 2 1000 Lausanne 25#> 3 Lausanne 26, 1000 Lausanne 26#> 4 1000 Lausanne 26#> 5 Via Cuolm Liung 30d, 7032 Laax GR 2#> 6 7032 Laax GR 2#> 7 Via Murschetg 29, 7032 Laax GR 2#> 230 1014 Lausanne#> 1137 1200 Genève#> 1138 Chemin des pralets, 74100 Etrembières, 1200 Genève#> 1139 1200 Genève#> 5480 1919 Martigny#> 5481 1919 Martigny#> 5482 1919 Martigny#> 5483 1919 Martigny#> 7623 2500 Biel/Bienne#> 7624 2500 Biel/Bienne#> 7625 2500 Biel/Bienne#> 7626 2500 Bienne#> 7627 2500 Biel/Bienne#> 7628 2500 Biel/Bienne#> 7629 2500 Biel/Bienne#> 7630 2500 Biel/Bienne#> 7631 2500 Biel/Bienne#> 7632 2500 Biel/Bienne#> 7633 2500 Biel/Bienne#> 7634 2500 Biel/Bienne#> 7635 Hohlenweg 11b, 2500 Biel/Bienne#> 8327 3000 Bern#> 8328 3000 Bern#> 8329 3000 Bern#> 8330 3000 Bern#> 8331 3000 Bern#> 8332 3000 Bern#> 8333 3000 Bern#> 8334 3000 Bern#> 8335 3000 Bern#> 10436 Lörrach Brombach Steinsack 6, 4000 Basel#> 10437 4000 Basel#> 10438 4000 Basel#> 12361 5201 Brugg AG#> 13214 in TRIENGEN, ca. 20 min. bei Luzern, 6000 Luzern#> 13967 6511 Cadenazzo#> 14243 Augio 1F, 6547 Augio#> 14561 6602 Muralto#> 14562 6602 Muralto#> 14563 6602 Muralto#> 14564 Via Bacilieri 2, 6602 Muralto#> 14565 6604 Solduno#> 14566 6604 Locarno#> 14567 6604 Solduno#> 16580 6901 Lugano#> 16581 6901 Lugano#> 16582 6903 Lugano#> 16583 6907 MASSAGNO#> 16584 6907 MASSAGNO#> 16585 6911 Campione d'Italia#> 16586 6911 Campione d'Italia#> 16587 6911 Campione d'Italia#> 16588 6911 Campione d'Italia#> 17899 Inder Platenga 34, 7133 Obersaxen#> 17908 7135 Fideris#> 18168 8000 Zürich#> 18169 8000 Zürich#> 18170 8000 Zürich#> 18171 8000 Zürich#> 18172 8000 Zürich#> 18173 8000 Zürich#> 18174 8000 Zürich#> 18175 8000 Zürich#> 18176 8000 Zürich#> 18177 8000 Zürich#> 18657 Stemmerstrasse 14, 8238 Büsingen am Hochrhein#> 19081 Chüngstrasse 60, 8423 Embrach#> 19082 Chüngstrasse 48, 8423 Embrach#> 20295 9241 Kradolf#> 20296 9241 Kradolf#> canton property_type floor year_category Community#> 1 Vaud Single house 1919-1945 <NA>#> 2 Vaud Villa 2006-2010 <NA>#> 3 Vaud Villa 1961-1970 <NA>#> 4 Vaud Apartment noteg 2016-2024 <NA>#> 5 Grisons Apartment eg 2016-2024 <NA>#> 6 Grisons Apartment noteg 2016-2024 <NA>#> 7 Grisons Apartment noteg 2011-2015 <NA>#> 230 Vaud Apartment eg 2011-2015 <NA>#> 1137 Geneva Single house 2011-2015 <NA>#> 1138 Geneva Bifamiliar house 2016-2024 <NA>#> 1139 Geneva Bifamiliar house 1981-1990 <NA>#> 5480 Valais Attic flat noteg 2016-2024 <NA>#> 5481 Valais Apartment noteg 2016-2024 <NA>#> 5482 Valais Apartment noteg 2016-2024 <NA>#> 5483 Valais Apartment noteg 2016-2024 <NA>#> 7623 Bern Single house 2001-2005 <NA>#> 7624 Bern Villa 2016-2024 <NA>#> 7625 Bern Apartment noteg 1971-1980 <NA>#> 7626 Bern Single house 2016-2024 <NA>#> 7627 Bern Villa 2016-2024 <NA>#> 7628 Bern Single house 2016-2024 <NA>#> 7629 Bern Single house 2016-2024 <NA>#> 7630 Bern Single house 2016-2024 <NA>#> 7631 Bern Single house 2016-2024 <NA>#> 7632 Bern Single house 2016-2024 <NA>#> 7633 Bern Single house 2016-2024 <NA>#> 7634 Bern Single house 2016-2024 <NA>#> 7635 Bern Single house 2001-2005 <NA>#> 8327 Bern Apartment noteg 2016-2024 <NA>#> 8328 Bern Apartment eg 2016-2024 <NA>#> 8329 Bern Apartment eg 2016-2024 <NA>#> 8330 Bern Apartment noteg 2016-2024 <NA>#> 8331 Bern Roof flat noteg 2016-2024 <NA>#> 8332 Bern Apartment noteg 2016-2024 <NA>#> 8333 Bern Duplex noteg 2016-2024 <NA>#> 8334 Bern Apartment eg 2016-2024 <NA>#> 8335 Bern Apartment noteg 1991-2000 <NA>#> 10436 Basel-Stadt Single house 1961-1970 <NA>#> 10437 Basel-Stadt Single house 2016-2024 <NA>#> 10438 Basel-Stadt Villa 2016-2024 <NA>#> 12361 Aargau Apartment noteg 2016-2024 <NA>#> 13214 Lucerne Apartment noteg 1991-2000 <NA>#> 13967 Ticino Apartment noteg 2016-2024 <NA>#> 14243 Grisons Single house 2016-2024 <NA>#> 14561 Ticino Single house 1981-1990 <NA>#> 14562 Ticino Single house 1981-1990 <NA>#> 14563 Ticino Apartment eg 1961-1970 <NA>#> 14564 Ticino Apartment noteg 1946-1960 <NA>#> 14565 Ticino Attic flat noteg 2011-2015 <NA>#> 14566 Ticino Apartment noteg 2011-2015 <NA>#> 14567 Ticino Apartment noteg 2011-2015 <NA>#> 16580 Ticino Attic flat noteg 2011-2015 <NA>#> 16581 Ticino Apartment noteg 2011-2015 <NA>#> 16582 Ticino Apartment noteg 2006-2010 <NA>#> 16583 Ticino Apartment noteg 2016-2024 <NA>#> 16584 Ticino Apartment noteg 2016-2024 <NA>#> 16585 Ticino Apartment noteg 1991-2000 <NA>#> 16586 Ticino Apartment noteg 1946-1960 <NA>#> 16587 Ticino Single house 1971-1980 <NA>#> 16588 Ticino Apartment eg 1946-1960 <NA>#> 17899 Grisons Single house 2006-2010 <NA>#> 17908 Grisons Single house 0-1919 <NA>#> 18168 Zurich Apartment noteg 2016-2024 <NA>#> 18169 Zurich Attic flat noteg 2016-2024 <NA>#> 18170 Zurich Apartment noteg 2016-2024 <NA>#> 18171 Zurich Apartment noteg 2016-2024 <NA>#> 18172 Zurich Apartment noteg 2016-2024 <NA>#> 18173 Zurich Apartment eg 2016-2024 <NA>#> 18174 Zurich Apartment noteg 2006-2010 <NA>#> 18175 Zurich Single house 2016-2024 <NA>#> 18176 Zurich Attic flat noteg 2006-2010 <NA>#> 18177 Zurich Apartment noteg 0-1919 <NA>#> 18657 Schaffhausen Apartment noteg 1961-1970 <NA>#> 19081 Zurich Bifamiliar house 2016-2024 <NA>#> 19082 Zurich Single house 2016-2024 <NA>#> 20295 Thurgau Apartment noteg 1991-2000 <NA>#> 20296 Thurgau Apartment noteg 1991-2000 <NA>#> Canton_code lon lat#> 1 <NA> NA NA#> 2 <NA> NA NA#> 3 <NA> NA NA#> 4 <NA> NA NA#> 5 <NA> NA NA#> 6 <NA> NA NA#> 7 <NA> NA NA#> 230 <NA> NA NA#> 1137 <NA> NA NA#> 1138 <NA> NA NA#> 1139 <NA> NA NA#> 5480 <NA> NA NA#> 5481 <NA> NA NA#> 5482 <NA> NA NA#> 5483 <NA> NA NA#> 7623 <NA> NA NA#> 7624 <NA> NA NA#> 7625 <NA> NA NA#> 7626 <NA> NA NA#> 7627 <NA> NA NA#> 7628 <NA> NA NA#> 7629 <NA> NA NA#> 7630 <NA> NA NA#> 7631 <NA> NA NA#> 7632 <NA> NA NA#> 7633 <NA> NA NA#> 7634 <NA> NA NA#> 7635 <NA> NA NA#> 8327 <NA> NA NA#> 8328 <NA> NA NA#> 8329 <NA> NA NA#> 8330 <NA> NA NA#> 8331 <NA> NA NA#> 8332 <NA> NA NA#> 8333 <NA> NA NA#> 8334 <NA> NA NA#> 8335 <NA> NA NA#> 10436 <NA> NA NA#> 10437 <NA> NA NA#> 10438 <NA> NA NA#> 12361 <NA> NA NA#> 13214 <NA> NA NA#> 13967 <NA> NA NA#> 14243 <NA> NA NA#> 14561 <NA> NA NA#> 14562 <NA> NA NA#> 14563 <NA> NA NA#> 14564 <NA> NA NA#> 14565 <NA> NA NA#> 14566 <NA> NA NA#> 14567 <NA> NA NA#> 16580 <NA> NA NA#> 16581 <NA> NA NA#> 16582 <NA> NA NA#> 16583 <NA> NA NA#> 16584 <NA> NA NA#> 16585 <NA> NA NA#> 16586 <NA> NA NA#> 16587 <NA> NA NA#> 16588 <NA> NA NA#> 17899 <NA> NA NA#> 17908 <NA> NA NA#> 18168 <NA> NA NA#> 18169 <NA> NA NA#> 18170 <NA> NA NA#> 18171 <NA> NA NA#> 18172 <NA> NA NA#> 18173 <NA> NA NA#> 18174 <NA> NA NA#> 18175 <NA> NA NA#> 18176 <NA> NA NA#> 18177 <NA> NA NA#> 18657 <NA> NA NA#> 19081 <NA> NA NA#> 19082 <NA> NA NA#> 20295 <NA> NA NA#> 20296 <NA> NA NA
::: Upon merging, we identified 77 rows where the Community was NA. This could happen for two reasons:
The zip code extracted from the address did not exist in the amto_df data set.
The zip code was incorrectly isolated from the address.
To ensure the integrity of our data, we decided to remove these rows with missing Community values: ::: {.cell layout-align=“center”}
Click to show code
#remove the rows with nan in cityproperties_filtered <- df[!is.na(df$Community),]reactable(head(properties_filtered, 100))
:::
2.3 Tax data
The Swiss Tax dataset encompasses comprehensive information on income, wealth, profits, and capital taxes for each canton and commune in Switzerland. It provides detailed data on tax rates and regulations, allowing for in-depth analysis and comparison across regions.
This dataset is sourced from various official authorities, including cantonal tax authorities and the Swiss Federal Tax Administration. It is regularly updated to reflect changes in tax laws, rates, and administrative details at both the cantonal and communal levels. This dataset might be useful to see if we have a link between some taxes and the prices of the properties
Updates and revisions to the dataset are provided periodically, ensuring its accuracy and relevance. The Swiss Federal Tax Administration oversees the distribution and management of this dataset, supporting its role in providing reliable and comprehensive tax information for Switzerland.
The dataset can be found here, take the year 2024 and the corresponding taxes.
# read csvimpots <-read.csv(file.path(here(),"data/estv_income_rates.csv"), sep =",", header =TRUE, stringsAsFactors =FALSE)# Remove 1st rowimpots <- impots[-1, ]# Remove 3rd columnimpots <- impots[, -3]# Combine text for columns 4-8impots[1, 4:8] <-"Impôt sur le revenu"# Combine text for columns 9-13impots[1, 9:13] <-"Impôt sur la fortune"# Combine text for columns 14-16impots[1, 14:16] <-"Impôt sur le bénéfice"# Combine text for columns 17-19impots[1, 17:19] <-"Impôt sur le capital"# Combine content of the first 2 rows into the 2nd rowimpots[2, ] <-apply(impots[1:2, ], 2, function(x) paste(ifelse(is.na(x[1]), x[2], ifelse(is.na(x[2]), x[1], paste(x[1], x[2], sep =" ")))))# Remove 1st rowimpots <- impots[-1, ]# Assign the text to the 1st row and 1st columnimpots[1, 1] <-"Coefficient d'impôt en %"# Replace column names with the content of the first rowcolnames(impots) <- impots[1, ]impots <- impots[-1, ]# Check for missing values in impotsany_missing <-any(is.na(impots))if (any_missing) {print("There are missing values in impots.")} else {print("There are no missing values in impots.")}#> [1] "There are no missing values in impots."# Replace row names with the content of the 3rd columnrow.names(impots) <- impots[, 3]impots <- impots[, -3]# Remove 2nd column (to avoid canton column)impots <- impots[, -2]# Remove impot egliseimpots <- impots[, -c(4:6)]impots <- impots[, -c(6:8)]impots <- impots[, -8]impots <- impots[, -10]# Clean data and convert to numericcleaned_impots <-apply(impots, 2, function(x) as.numeric(gsub("[^0-9.-]", "", x)))# Replace NA values with 0cleaned_impots[is.na(cleaned_impots)] <-0# Check for non-numeric valuesnon_numeric <-sum(!is.na(cleaned_impots) &!is.numeric(cleaned_impots))if (non_numeric >0) {print(paste("Warning: Found", non_numeric, "non-numeric values."))}rownames(cleaned_impots) <-rownames(impots)#reactable(head(cleaned_impots, 100))
2.4 Commune Data
2.4.1 Wrangling and Cleaning
The Regional Portrait 2021 (Portraits régionaux 2021: chiffres-clés de toutes les communes) is the most recent data set provided by the Swiss Federal Statistical Office, providing key figures for all municipalities in Switzerland. We incorporate this data set in our analysis to include the external environment that may impact the prices of real estate. The completeness and precision of the data makes the cleaning task easier. The outlines of the significant steps are as follows:
We select only the most recent year with complete data (2019).
We remove the marked missing values. Either marked as “M = Not indicated because data was not important or applicable” or as “Q = Not indicated to protect confidentiality”.
The data set comprising of both detailed data and aggregates:
We remove aggregate values to retain the most information.
We make the decision to replace the 480 missing values for the “Taux de couverture sociale” variable by the Swiss mean for 2019 (3.2%). These values were marked as missing for reason “Q = Not indicated to protect confidentiality”. We are aware that this decision introduces some bias. For example, it could be hypothesized that the Federal Statistic Office had decided to remove the value from public registers above a certain percentage to protect confidentiality and avoid prejudice, therefore distorting our analysis. Nonetheless, we value the information contained in this variable, and therefore, keep it in our model.
# il faudra changer le pathcommune_prep <-read.csv(file.path(here(),"data/commune_data.csv"), sep =";", header =TRUE, stringsAsFactors =FALSE)# We keep only 2019 to have some reference? (2020 is apparently not really complete)commune_2019 <-subset(commune_prep, PERIOD_REF =="2019") %>%select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE", "STATUS"))# delete les lignes ou Status = Q ou M (pas de valeur) et ensuite on enlève la colonnecommune_2019 <-subset(commune_2019, STATUS =="A") %>%select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE"))# on enlève les lignes qui sont des aggrégatscommune_2019 <-subset(commune_2019, REGION !="Schweiz")commune_2019 <- commune_2019 %>%pivot_wider(names_from = INDICATORS, values_from = VALUE)# Rename columns using the provided mapcommune <- commune_2019 %>%rename(`Population - Habitants`= Ind_01_01,`Population - Densité de la population`= Ind_01_03,`Population - Etrangers`= Ind_01_08,`Population - Part du groupe d'âge 0-19 ans`= Ind_01_04,`Population - Part du groupe d'âge 20-64 ans`= Ind_01_05,`Population - Part du groupe d'âge 65+ ans`= Ind_01_06,`Population - Taux brut de nuptialité`= Ind_01_09,`Population - Taux brut de divortialité`= Ind_01_10,`Population - Taux brut de natalité`= Ind_01_11,`Population - Taux brut de mortalité`= Ind_01_12,`Population - Ménages privés`= Ind_01_13,`Population - Taille moyenne des ménages`= Ind_01_14,`Sécurité sociale - Taux d'aide sociale`= Ind_11_01,`Conseil national - PLR`= Ind_14_01,`Conseil national - PDC`= Ind_14_02,`Conseil national - PS`= Ind_14_03,`Conseil national - UDC`= Ind_14_04,`Conseil national - PEV/PCS`= Ind_14_05,`Conseil national - PVL`= Ind_14_06,`Conseil national - PBD`= Ind_14_07,`Conseil national - PST/Sol.`= Ind_14_08,`Conseil national - PES`= Ind_14_09,`Conseil national - Petits partis de droite`= Ind_14_10)# If no one voted for a party, set as NA -> replacing it with 0 insteadcommune <- commune %>%mutate_at(vars(starts_with("Conseil national")), ~replace_na(., 0))# Removing NAs from Taux de couverture sociale column# Setting the mean as the mean for Switzerland in 2019 (3.2%)mean_taux_aide_social <-3.2# Replace NA values with the meancommune <- commune %>%mutate(`Sécurité sociale - Taux d'aide sociale`=if_else(is.na(`Sécurité sociale - Taux d'aide sociale`), mean_taux_aide_social, `Sécurité sociale - Taux d'aide sociale`))#show 100 first rows of commune using reactablereactable(head(commune, 100))
Click to show code
# commune_prep <- read.csv(file.path(here(),"data/commune_data.csv"), sep = ";", header = TRUE, stringsAsFactors = FALSE)# # # We keep only 2019 to have some reference? (2020 is apparently not really complete)# commune_2019 <- subset(commune_prep, PERIOD_REF == "2019") %>%# select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE", "STATUS"))# # # delete les lignes ou Status = Q ou M (pas de valeur) et ensuite on enlève la colonne# commune_2019 <- subset(commune_2019, STATUS == "A") %>%# select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE"))# # # on enlève les lignes qui sont des aggrégats# commune_2019 <- subset(commune_2019, REGION != "Schweiz")# # commune_2019 <- commune_2019 %>%# pivot_wider(names_from = INDICATORS, values_from = VALUE)# # # Rename columns using the provided map# commune <- commune_2019 %>%# rename(`Population - Habitants` = Ind_01_01,# `Population - Densité de la population` = Ind_01_03,# `Population - Etrangers` = Ind_01_08,# `Population - Part du groupe d'âge 0-19 ans` = Ind_01_04,# `Population - Part du groupe d'âge 20-64 ans` = Ind_01_05,# `Population - Part du groupe d'âge 65+ ans` = Ind_01_06,# `Population - Taux brut de nuptialité` = Ind_01_09,# `Population - Taux brut de divortialité` = Ind_01_10,# `Population - Taux brut de natalité` = Ind_01_11,# `Population - Taux brut de mortalité` = Ind_01_12,# `Population - Ménages privés` = Ind_01_13,# `Population - Taille moyenne des ménages` = Ind_01_14,# `Sécurité sociale - Taux d'aide sociale` = Ind_11_01,# `Conseil national - PLR` = Ind_14_01,# `Conseil national - PDC` = Ind_14_02,# `Conseil national - PS` = Ind_14_03,# `Conseil national - UDC` = Ind_14_04,# `Conseil national - PEV/PCS` = Ind_14_05,# `Conseil national - PVL` = Ind_14_06,# `Conseil national - PBD` = Ind_14_07,# `Conseil national - PST/Sol.` = Ind_14_08,# `Conseil national - PES` = Ind_14_09,# `Conseil national - Petits partis de droite` = Ind_14_10)# # # If no one voted for a party, set as NA -> replacing it with 0 instead# commune <- commune %>%# mutate_at(vars(starts_with("Conseil national")), ~replace_na(., 0))# # # # Removing NAs from Taux de couverture sociale column# # Setting the mean as the mean for Switzerland in 2019 (3.2%)# mean_taux_aide_social <- 3.2# # # Replace NA values with the mean# commune <- commune %>%# mutate(`Sécurité sociale - Taux d'aide sociale` = if_else(is.na(`Sécurité sociale - Taux d'aide sociale`), mean_taux_aide_social, `Sécurité sociale - Taux d'aide sociale`))#
:::
3 Unsupervised learning
Clustering and/or dimension reduction
We decided to
In order to explore the relationship between real estate prices and external factor, we decided to perform three unsupervised clustering methods on fiscal, demographic and political data sets for each Swiss municipalities. The resulting clusters are then included as features of our supervised models, as the municipalities within those clusters follow roughly the same behavior in these regards.
3.1 Fiscal clustering
First, we performed a k-means clustering on the fiscal data set. The elbow method with the within-sum of squares resulted in 5 clusters.
Click to show code
# Clean data and convert to numericset.seed(123)cleaned_impots <-apply(impots, 2, function(x) as.numeric(gsub("[^0-9.-]", "", x)))cleaned_impots[is.na(cleaned_impots)] <-0# Replace NA values with 0# Scale the featuresscaled_impots <-scale(cleaned_impots)# Perform k-means clusteringk <-2# Initial guess for the number of clusterskmeans_model <-kmeans(scaled_impots, centers = k)# Check within-cluster sum of squares (elbow method)wss <-numeric(10)for (i in1:10) { kmeans_model <-kmeans(scaled_impots, centers = i) wss[i] <-sum(kmeans_model$withinss)}plot(1:10, wss, type ="b", xlab ="Number of Clusters", ylab ="Within groups sum of squares")# Adjust k based on elbow methodk <-5# Perform k-means clustering again with optimal kkmeans_model <-kmeans(scaled_impots, centers = k)# Assign cluster labels to dendrogramclusters <- kmeans_model$cluster# Plot dendrogram#colored_dend <- color_branches(dend, k = 5)#y_zoom_range <- c(0, 80) # Adjust the y-axis range as needed#plot(colored_dend, main = "Hierarchical Clustering Dendrogram", horiz = FALSE, ylim = y_zoom_range)
Here we can see that the optimal number of clusters is either 5 or 7. We decided to stop at 5.
Next, we will interpret the clusters by looking at the cluster centers, the size of each cluster, and the distribution of the variables within each cluster. ::: {.cell layout-align=“center”}
Click to show code
# Get the cluster centerscluster_centers <- kmeans_model$centers# Create a data frame with cluster centerscluster_centers_df <-data.frame(cluster =1:k, cluster_centers)# Print cluster centers# print(cluster_centers_df)# Calculate the size of each clustercluster_sizes <-table(kmeans_model$cluster)# Print cluster sizes# print(cluster_sizes)# Get the cluster labelscluster_labels <- kmeans_model$cluster# Convert cleaned_impots to a data frameimpots_cluster <-as.data.frame(cleaned_impots)# Add the cluster labels to cleaned_impotsimpots_cluster$cluster <- cluster_labelsrownames(impots_cluster) <-rownames(impots)impots_cluster <- impots_cluster %>%rownames_to_column(var ="Community")
:::
And then we will plot the boxplots of the variables in each cluster to interpret the clusters. ::: {.cell layout-align=“center”}
Click to show code
# Subset your dataset to include only the variables used to create the tax clusters and the tax cluster labelstax_vars <-select(impots_cluster, -c("Community", "cluster", "Coefficient d'impôt en %"))# Scale the variablesscaled_tax_vars <-scale(tax_vars)# Convert to data framescaled_tax_vars <-as.data.frame(scaled_tax_vars)# Add tax cluster labelsscaled_tax_vars$Tax_cluster <- impots_cluster$cluster# Melt the dataset to long formatmelted_tax <-melt(scaled_tax_vars, id.vars ="Tax_cluster")
:::
Click to show code
# Subset your dataset to include only the variables used to create the tax clusters and the tax cluster labelstax_vars <-select(impots_cluster, -c("Community", "cluster", "Coefficient d'impôt en %"))# Scale the variablesscaled_tax_vars <-scale(tax_vars)# Convert to data framescaled_tax_vars <-as.data.frame(scaled_tax_vars)# Add tax cluster labelsscaled_tax_vars$Tax_cluster <- impots_cluster$cluster# Melt the dataset to long formatmelted_tax <-melt(scaled_tax_vars, id.vars ="Tax_cluster")# Create boxplots for each variable using ggplot2 with viridis colorsp <-ggplot(melted_tax, aes(x =as.factor(Tax_cluster), y = value, fill =as.factor(Tax_cluster))) +geom_boxplot() +facet_wrap(~ variable, scales ="free", ncol =2) +# Arrange plots in 2 columnsscale_fill_viridis_d() +# Use viridis color palettetheme_minimal(base_size =15) +# Increase base font size for larger plottheme(legend.position ="none",plot.title =element_text(hjust =0.5), ) +labs(x ="",y ="",title ="Boxplots of Scaled Tax Variables by Cluster" )# Convert ggplot to an interactive plot using plotlyinteractive_plot <-ggplotly(p, width =800, height =1000)# Print the interactive plotinteractive_plot
The fiscal clusters are quite difficult to interpret. A few interesting observations we can make are the following:
Cluster 1 seems to have average-to-low taxes accross the board
Cluster 2 has a very similar behaviour to cluster 1, with lower state (cantonal) taxes
Cluster 3 seems to have higher municipal taxes than cluster 1 and 2,
Cluster 4 has a very similar behaviour to cluster 2
Cluster 5 has high cantonal taxes, while having average communal (municipal) taxes. This cluster has overall the highest taxes for individuals
We are however aware that these interpretation, has well as the interpretation given in the following sections fail to encompass the whole picture. Moreover, the clustering ran on all fiscal values fails to capture the difference in attractiveness for individuals and companies (taxes on income/wealth vs on profits).
3.2 Demographic clustering
Then, we performed a hierarchical clustering. First, the data was scaled (some features are percentages, some are real values), then the dissimilarity matrix was computed using the Minkowski method, then Ward’s method was used for the linkage.
As the optimal number of clusters for the fiscal data set was determined to be 5, we decided to continue our analysis of the two other data sets with 5 clusters in order to keep the same scale (even though categorical) for the 3 features resulting from the unsupervised clustering.
Click to show code
# Clustering demographiccols_commune_demographic <-select(commune, -c("REGION", "CODE_REGION","Conseil national - PLR","Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))# Scale the columns, some are total numbers, some are percentagescols_commune_demographic <-scale(cols_commune_demographic)# Calculate the distance matrixdist_matrix_demographic <-dist(cols_commune_demographic, method ="minkowski")# Perform hierarchical clusteringhclust_model_demographic <-hclust(dist_matrix_demographic, method ="ward.D2")# Create dendrogramdend_demo <-as.dendrogram(hclust_model_demographic)dend_demo <-color_branches(dend_demo, k =5) #Set number of cluster to 5, to keep the same scale for all our variablespar(mar =c(0.001, 4, 4, 2) +0.1)# plot(dend_demo, main = "Demographics - Hierarchical Clustering Dendrogram", xlab = "")
The unsupervised clustering method performed on the demographic data of Swiss municipalities return some interesting results.
Our first cluster seems to be for municipalities where a lot of families with children live (“Part du group d’âge 0-19 ans” is high, “Taille moyenne des ménages high). Moreover, we can observe the very low values for”Habitants/Densité de la population” (inhabitants, population density). From this, we can infer that cluster 1 encompasses rural municipalities, geared towards families.
Cluster 2 and 3 are very similar, with a lot of variables showing no special indication. It is however to note that municipalities in cluster 2 have slightly higher population density than cluster 3, with more foreign inhabitants. We could therefore hypothesize that cluster 2 is more urban that cluster 3.
Cluster 4 seems to be for municipalities in large cities (Large and dense population, with most of its inhabitants being 20 to 64 years old). We can also note the high share of foreign inhabitants. This value could be explained by the large foreign workforce in large Swiss cities where large corporations and NGOs operate. Moreover, the above-average share of welfare recipients (Taux d’aide sociale) further reinforce the large city hypothesis, where wealth disparities are more prevalent.
Cluster 5 seems to be for municipalities with an aging population (“Part du groupe d’âge 65+ ans” and “Taux de mortalité” with high values). The low values in population density further paints the picture of the small rural villages in remote areas.
3.3 Political clustering
The same process was used for our political data set, with 5 clusters for the same reasons. The share of each major parties voted for the Conseil National are represented. The only difference was that we did not scale our data, as all features are percentages. ::: {.cell layout-align=“center”}
Click to show code
# Clustering politicscols_commune_politics <-select(commune, c("Conseil national - PLR","Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))# Calculate the distance matrixdist_matrix_politics <-dist(cols_commune_politics, method ="minkowski")# Perform hierarchical clusteringhclust_model_politics <-hclust(dist_matrix_politics, method ="ward.D2")# Create dendrogramdend_pol <-as.dendrogram(hclust_model_politics)dend_pol <-color_branches(dend_pol, k =5) #Set number of cluster to 5, to keep the same scale for all our variables# plot(dend_pol, main = "Politics - Hierarchical Clustering Dendrogram")# Subset your dataset to include only the variables used to create the political clusters and the political cluster labelspolitical_vars <-select(commune, c("Conseil national - PLR","Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))colnames(political_vars) <-sub("Conseil national - ", "", colnames(political_vars))# Add political cluster labelspolitical_vars$Political_cluster <-cutree(hclust_model_politics, k =5)# Melt the dataset to long formatmelted_political <-melt(political_vars, id.vars ="Political_cluster")# Create boxplots for each variable using ggplot2 with pastel colorsp <-ggplot(melted_political, aes(x =as.factor(Political_cluster), y = value, fill =as.factor(Political_cluster))) +geom_boxplot() +facet_wrap(~ variable, scales ="free", ncol =2) +# Arrange plots in 2 columnstheme_minimal(base_size =15) +# Increase base font size for larger plotscale_fill_viridis_d() +# Use viridis color palettetheme(legend.position ="none",plot.title =element_text(hjust =0.5), ) +labs(x ="Political Cluster",y ="",title ="Boxplots of Political Variables by Cluster" )# Convert ggplot to an interactive plot using plotlyinteractive_plot <-ggplotly(p, width =800, height =1000)# Print the interactive plotinteractive_plot
:::
The political clusters are more difficult to interpret than the demographic ones. It is however interesting to note the following points:
Cluster 1 has average values for most major political parties, displaying equal strengths across the political spectrum within the municipality.
Cluster 2 has a fairly high value for UDC while the other political parties receive average votes. This paints the picture of municipalities that lean more towards the right.
Cluster 3 has fairly high values for left-leaning parties (PS, PST) and one center-right party (PLR). This seems to show the opposite behaviour to cluster 2, with a balanced view but leaning towards the left.
Cluster 4 finds its highest values in PDC and UDC. Municipalities in cluster 4 are therefore very right-leaning.
Cluster 5’s most striking difference is its large distribution amongst “Petits partis de droite”. We could maybe hypothesize that these municipalities are from the Italian-speaking part of Switzerland, where a lot of small right-wing parties find a lot of support.
Click to show code
# Preparing df_commune for merging with main datasetdf_commune <-select(commune, REGION)df_commune$Demographic_cluster <-cutree(hclust_model_demographic, k =5)df_commune$Political_cluster <-cutree(hclust_model_politics, k =5)# Preparing to mergemerging <-inner_join(amto_df, df_commune, by =c("Community"="REGION"))impots_cluster_subset <- impots_cluster[, c("Community", "cluster")]merging <- merging %>%left_join(impots_cluster_subset, by ="Community")clusters_df <- merging %>%rename(Tax_cluster = cluster) %>%rename(Commune = Community)clusters_df <- clusters_df %>%select(c("Commune", "zip_code", "Canton_code", "Demographic_cluster", "Political_cluster", "Tax_cluster"))# Only NAs are for commune Brugg, (written Brugg (AG) in the other data set) -> j'entre le cluster à la manoclusters_df$Tax_cluster[is.na(clusters_df$Tax_cluster)] <-2# adding it to our main data set:properties_filtered <-merge(properties_filtered, clusters_df[, c("zip_code", "Demographic_cluster", "Political_cluster", "Tax_cluster")], by ="zip_code", all.x =TRUE)
Here, we encountered an issue when merging our clusters with the main data set. Indeed, the clusters’ municipalities’ names and the main data set’s municipalities’ names were not exactly the same. Trying to merge via the zip codes also resulted in a failure. Given the size of our data set (20k+ rows), and given the heterogeneous reparation of the missing data, we took the decision to remove these 228 rows from our main data set. ::: {.cell layout-align=“center”}
Click to show code
# Dropping 228 rows containing NAs after the merge (Problem with names)# Find rows with NA values in the specified columnsna_rows <-subset(properties_filtered, is.na(Demographic_cluster) |is.na(Political_cluster) |is.na(Tax_cluster))# Drop the NA rowsproperties_filtered <-anti_join(properties_filtered, na_rows, by ="zip_code")
:::
4 EDA
4.1 Map representation of distribution of properties
Here we decided to represent the distribution of properties in Switzerland using a map. The map is interactive and allows users to hover over the markers to see the price. The markers are color-coded in orange and have a semi-transparent fill to reduce visual noise. The size of the markers is smaller to optimize the visual representation of the data.
This visualization helps in understanding the geographical spread and density of properties in the dataset.
Click to show code
# Create a leaflet map with optimized markersmap <-leaflet(properties_filtered) %>%addTiles() %>%# Add default OpenStreetMap tilesaddProviderTiles(providers$Esri.NatGeoWorldMap) %>%# Add topographic maps for contextaddCircleMarkers(~lon, ~lat,radius =1.5, # Smaller radius for the circle markerscolor ="#F97300", # Specifying a color for the markersfillOpacity =0.2, # Semi-transparent fillstroke =FALSE, # No border to the circle markers to reduce visual noisepopup =~paste("Price: ", price, "<br>","Rooms: ", number_of_rooms, "<br>","Type: ", property_type, "<br>","Year: ", year_category),label =~paste("Price: ", price) # Tooltip on hover ) %>%addLegend(position ="bottomright", # Position the legend at the bottom rightcolors ="#F97300", # Use the same color as the markerslabels ="Properties"# Label for the legend )map$width <-"100%"# Set the width of the map to 100%map$height <-600# Set the height of the map to 600 pixelsmap
The map highlights that properties are well-distributed throughout the region, with fewer properties in the Alpine region, which is expected due to its mountainous terrain. We thus have a good representation of the data across different cantons and locations and we can use it for further analysis.
4.2 Histogram of prices
Click to show code
# Define breaks for x-axis in millionsbreaks <-seq(0, 25000000, by =5000000)labels <-paste0(breaks/1000000, "Mio")# Calculate percentilepercentile_95 <-quantile(properties_filtered$price, 0.95)percentile_05 <-quantile(properties_filtered$price, 0.05)percentile_99 <-quantile(properties_filtered$price, 0.99)# Create the histogramhistogram_price <-ggplot(properties_filtered, aes(x = price)) +geom_histogram(binwidth =100000, fill ="skyblue", color ="red") +geom_vline(xintercept = percentile_05, linetype ="dashed", color ="blue")+geom_text(aes(x = percentile_05, y =1700, label ="5th percentile"), vjust =-1, color ="blue", size =2) +geom_vline(xintercept = percentile_95, linetype ="dashed", color ="blue") +geom_text(aes(x = percentile_95, y =1750, label ="95th percentile"), vjust =-1, color ="blue", size =2) +geom_vline(xintercept = percentile_99, linetype ="dashed", color ="blue") +geom_text(aes(x = percentile_99, y =1800, label ="99th percentile"), vjust =-1, color ="blue", size =2) +labs(title ="Distribution of Prices",x ="Price in Chf",y ="Frequency") +theme_minimal() +scale_x_continuous(breaks = breaks, labels = labels)# Convert ggplot object to plotly objectinteractive_histogram_price <-ggplotly(histogram_price, width =600, height =500)# Display the interactive histograminteractive_histogram_price
As we can see, 90% of the properties are concentrated between 395590 chf and 3,3 million chf. Feel free zoom on this part of the graph to see the majority of the properties. Only 5% worth more than 3,3 million and only 1% of the properties worth more than 6,6 million.
4.3 Price between 0 and 3,5 millions
To enhance data visibility, we will focus on the majority of the data between 0 and 3,5 million, while filtering out outliers.
4.3.1 Histogram of prices for each property type
Click to show code
# Define breaks for x-axis in millionsbreaks <-seq(0, 3500000, by =1000000)labels <-paste0(breaks/1000000, "Mio")# Create the ggplot objecthistogram <-ggplot(properties_filtered, aes(x = price)) +geom_histogram(binwidth =100000, fill ="skyblue", color ="black") +facet_wrap(~ property_type, scales ="free", ncol =2) +labs(title ="Distribution of Prices by Property Type",x ="Price in Chf",y ="Frequency") +theme_minimal() +scale_x_continuous(breaks = breaks, labels = labels, limits =c(0, 3500000))# Convert ggplot object to plotly objectinteractive_histogram <-ggplotly(histogram, width =750, height =1100)# Adjust margins to prevent x-axis label from being cut offinteractive_histogram <-layout(interactive_histogram, margin =list(l =50, r =50, b =50, t =50), autosize =TRUE)# Display the interactive plotinteractive_histogram
Upon first glance, the majority of property types are valued at less than 3 million, with apartments and single houses being the most frequent.
4.3.2 Histogram of prices for each year category
Click to show code
# Define breaks for x-axis in millionsbreaks <-seq(0, 3500000, by =1000000)labels <-paste0(breaks/1000000, "Mio")# Create a histogram of prices for each year categoryhistogram <-ggplot(properties_filtered, aes(x = price)) +geom_histogram(binwidth =100000, fill ="skyblue", color ="black") +facet_wrap(~ year_category, scales ="free", ncol =2) +labs(title ="Distribution of Prices by Year Category",x ="Price in Chf",y ="Frequency") +theme_minimal() +scale_x_continuous(breaks = breaks, labels = labels, limits =c(0, 3500000))# Convert ggplot object to plotly objectinteractive_histogram_year <-ggplotly(histogram, width =750, height =1100)# Adjust margins to prevent x-axis label from being cut offinteractive_histogram <-layout(interactive_histogram, margin =list(l =50, r =50, b =50, t =50), autosize =TRUE)# Display the interactive plotinteractive_histogram_year
The majority of properties were built between 2016 and 2024. Interestingly, the distribution remains similar across almost all periods.
4.3.3 Histogram of prices for each canton
Here we extend a little bit to better see the difference between Cantons
Click to show code
# Define breaks for x-axis in millionsbreaks <-seq(0, 5200000, by =1000000)labels <-paste0(breaks/1000000, "Mio")# Create the histogramhistogram <-ggplot(properties_filtered, aes(x = price)) +geom_histogram(binwidth =100000, fill ="skyblue", color ="black") +facet_wrap(~ canton, scales ="free", ncol =2) +labs(title ="Distribution by Canton for properties between 0 and 5 million",x ="Price in Chf",y ="Frequency") +theme(axis.text.y =element_text(size =2)) +theme_minimal() +scale_x_continuous(breaks = breaks, labels = labels, limits =c(0, 5200000))# Convert ggplot object to plotly object with adjusted heightinteractive_histogram <-ggplotly(histogram, width =750, height =1100)# Adjust margins to prevent x-axis label from being cut offinteractive_histogram <-layout(interactive_histogram, margin =list(l =50, r =50, b =50, t =50), autosize =TRUE)# Display the interactive plotinteractive_histogram
Compared to other cantons, Geneva has a distinct distribution with many properties that worth more than 2 million (relative to the others) The canton of Vaud, Valais, Tessin, Bern, and Fribourg are where the majority of the listed properties are concentrated and have a similar distribution where the majority properties worth between 0,4 and 2 million. The model needs to account for the different distributions of cantons to ensure fair comparison, avoiding bias towards larger cantons over smaller ones.
4.3.4 Histogram of prices for each number of rooms
Click to show code
# Define breaks for x-axis in millionsbreaks <-seq(0, 3500000, by =1000000)labels <-paste0(breaks/1000000, "Mio")subset_properties <- properties_filtered %>%filter(number_of_rooms <=15)# Create the histogramhistogram <-ggplot(subset_properties, aes(x = price)) +geom_histogram(binwidth =100000, fill ="skyblue", color ="black") +facet_wrap(~ number_of_rooms, scales ="free", ncol =2) +labs(title ="Distribution of Prices by Number of Rooms",x ="Price in Chf",y ="Frequency") +theme_minimal() +scale_x_continuous(breaks = breaks, labels = labels, limits =c(0, 3500000))# Convert ggplot object to plotly object with adjusted heightinteractive_histogram <-ggplotly(histogram, width =750, height =1500) # Adjust margins to prevent x-axis label from being cut offinteractive_histogram <-layout(interactive_histogram, margin =list(l =50, r =50, b =50, t =50), autosize =TRUE)# Display the interactive plotinteractive_histogram
The majority of properties have between 2,5 and 6,5 rooms. And the distribution tends to shift slightly towards higher prices as the number of rooms increases.
4.4 Histogram of properties by square meters
To better see the data, we only show the properties with less than 1000 square meters
Click to show code
# Calculate percentilepercentile_95 <-quantile(properties_filtered$square_meters, 0.95)percentile_05 <-quantile(properties_filtered$square_meters, 0.05)histogram <-ggplot(properties_filtered, aes(x = square_meters)) +geom_histogram(binwidth =15, fill ="skyblue", color ="black") +geom_vline(xintercept = percentile_05, linetype ="dashed", color ="blue")+geom_text(aes(x = percentile_05, y =1750, label ="5th percentile"), vjust =-1, color ="blue", size =2) +geom_vline(xintercept = percentile_95, linetype ="dashed", color ="blue") +geom_text(aes(x = percentile_95, y =1750, label ="95th percentile"), vjust =-1, color ="blue", size =2) +labs(title ="Distribution of Properties by Square Meters",x ="Square Meters",y ="Frequency") +theme_minimal() +xlim(0,1000)# Convert ggplot object to plotly object with adjusted heightinteractive_histogram <-ggplotly(histogram, width =750, height =1100 ) # Adjust width and height as needed# Adjust margins to prevent x-axis label from being cut offinteractive_histogram <-layout(interactive_histogram, margin =list(l =50, r =50, b =50, t =50), autosize =TRUE)# Display the interactive plotinteractive_histogram
No surprise here, there are more “small” properties than big ones. 90% of the properties are between 62 and 330 square meters.
4.5 Histogram of prices by Tax Cluster
Click to show code
# Calculate summary statistics for price by Tax Clustersummary_stats <- properties_filtered %>%group_by(Tax_cluster) %>%summarise(avg_price =mean(price),Q10 =quantile(price, 0.10),Q90 =quantile(price, 0.90))# Plot line plotline_plot <-ggplot(summary_stats, aes(x = Tax_cluster)) +geom_line(aes(y = avg_price, color ="Mean Price")) +geom_line(aes(y = Q10, color ="10th Quartile")) +geom_line(aes(y = Q90, color ="90th Quartile")) +labs(title ="Average Property Prices by Tax Cluster",x ="Tax Cluster",y ="Price in CHF") +scale_color_manual(values =c("Mean Price"="blue", "10th Quartile"="green", "90th Quartile"="red")) +theme_minimal()+scale_y_continuous(limits =c(0, 3500000))# Display the line plotline_plot
Based on the results, the clusters 1 and 5 are similar with the mean around 1 million chf, same for 3 and 4 with a mean of 1,25 million , the cluster 2 seem to be the one slightly different from the others with a mean of 1,5 million and 80% of the properties are between 0,5 and 2,7 million.
4.6 Histogram of prices by Political cluster
Click to show code
# Calculate summary statistics for price by Political Clustersummary_stats <- properties_filtered %>%group_by(Political_cluster) %>%summarise(avg_price =mean(price),Q10 =quantile(price, 0.10),Q90 =quantile(price, 0.90))# Plot line plotline_plot <-ggplot(summary_stats, aes(x = Political_cluster)) +geom_line(aes(y = avg_price, color ="Mean Price")) +geom_line(aes(y = Q10, color ="10th Quartile")) +geom_line(aes(y = Q90, color ="90th Quartile")) +labs(title ="Average Property Prices by Political Cluster",x ="Political Cluster",y ="Price in CHF") +scale_color_manual(values =c("Mean Price"="blue", "10th Quartile"="green", "90th Quartile"="red")) +theme_minimal() +scale_y_continuous(limits =c(0, 3500000))# Display the line plotline_plot
Based on this graphs the political clusters 1, 2 and 4 are similar with a mean price of properties of 1,1 million, the political cluster 3 as the higher mean price with 1,75 million. And the 5th one is between the cluster 3 and 1,2 and 4 with a mean of 1,3 million.
4.7 Histogram of prices by demographic cluster
Click to show code
# Calculate summary statistics for price by Demographic Clustersummary_stats <- properties_filtered %>%group_by(Demographic_cluster) %>%summarise(avg_price =mean(price),Q10 =quantile(price, 0.10),Q90 =quantile(price, 0.90))# Plot line plotline_plot <-ggplot(summary_stats, aes(x = Demographic_cluster)) +geom_line(aes(y = avg_price, color ="Mean Price")) +geom_line(aes(y = Q10, color ="10th Quartile")) +geom_line(aes(y = Q90, color ="90th Quartile")) +labs(title ="Average Property Prices by Demographic Cluster",x ="Demographic Cluster",y ="Price in CHF") +scale_color_manual(values =c("Mean Price"="blue", "10th Quartile"="green", "90th Quartile"="red")) +theme_minimal() +scale_y_continuous(limits =c(0, 4000000))# Display the line plotline_plot
5 Supervised learning
Data splitting (if a training/test set split is enough for the global analysis, at least one CV or bootstrap must be used)
Two or more models
Two or more scores
Tuning of one or more hyperparameters per model
Interpretation of the model(s)
5.1 Model 1
This section provides a comprehensive outline of the linear regression model and analysis methods employed in the study of property price determinants.
5.1.1 Tools and Software
The study was conducted using R, a programming language and environment widely recognized for its robust capabilities in statistical computing and graphics. Key packages used include:
corrplot for visualization of correlation matrices, aiding in preliminary feature selection. car for diagnostic testing and variance inflation factor (VIF) analysis to detect multicollinearity among predictors.
caret for creating training and testing sets, and managing cross-validation processes.
ggplot2 and plotly for creating visual representations of model diagnostics, predictions, and residuals.
gtsummary for creating publication-ready tables summarizing regression analysis results.
Each of these tools was chosen for its specific functionality that aids in data analysis, ensuring that each step of the model building and evaluation process is well-supported.
Initially, a correlation analysis was conducted to identify and visualize linear relationships between the property prices and potential predictive variables such as the number of rooms and square meters. The correlation matrix was computed and plotted using the corrplot package:
We can observe that the correlation between the number of rooms and price (0.46) and square meters and price (0.67) suggests a moderate correlation with the number of rooms and a strong correlation with square meters.
The number of rooms and square meters also have a strong correlation (0.7), indicating potential multicollinearity between these predictors.
No multicollinearity issues are observed for the predictors number_of_rooms and square_meters, with VIF values below 5 as though they are correlated, the correlation is not strong enough to cause multicollinearity issues.
High VIF Values:
canton and Tax_cluster have VIF values much greater than 10, indicating serious multicollinearity issues. These predictors are highly correlated with other predictors in the model.
Moderate VIF Values:
Other predictors like number_of_rooms, square_meters, Political_cluster, etc., have VIF values below 5, indicating acceptable multicollinearity.
We removed the canton variable from the model due to its high VIF value, which could lead to unstable coefficient estimates and unreliable model predictions.
We keep the Tax_cluster variable in the model for now, as it may provide valuable information for predicting property prices.
####Basic Model ##### Model Building and Evaluation
The data set was split into training and testing sets to validate the model’s performance. The linear regression model was then fitted using selected predictors: ::: {.cell layout-align=“center”}
Diagnostic checks such as residual analysis and normality tests were conducted to validate model assumptions. Performance metrics including R-squared and RMSE were calculated to assess the model’s explanatory power and prediction accuracy.
Click to show code
# Model Evaluation## Diagnostic Checks#plot(lm_model)#ad.test(residuals(lm_model))#use gt summary to show the resulttbl_reg_1 <- gtsummary::tbl_regression(lm_model_1)tbl_reg_1
Characteristic
Beta
95% CI1
p-value
number_of_rooms
3,541
-8,741, 15,822
0.6
square_meters
9,149
8,937, 9,360
<0.001
property_type
Apartment
—
—
Attic flat
131,957
52,635, 211,279
0.001
Bifamiliar house
-189,373
-265,304, -113,443
<0.001
Chalet
138,465
42,911, 234,018
0.005
Duplex
-116,789
-214,424, -19,154
0.019
Farm house
-521,808
-742,296, -301,320
<0.001
Loft
-83,702
-555,442, 388,039
0.7
Roof flat
-94,839
-208,057, 18,379
0.10
Rustic house
-79,809
-526,117, 366,499
0.7
Single house
-164,801
-209,811, -119,791
<0.001
Terrace flat
-42,541
-202,067, 116,986
0.6
Villa
140,285
68,664, 211,907
<0.001
floor
floor
—
—
eg
25,247
-20,375, 70,869
0.3
noteg
year_category
0-1919
—
—
1919-1945
242,007
136,214, 347,800
<0.001
1946-1960
298,608
198,889, 398,326
<0.001
1961-1970
259,620
175,279, 343,961
<0.001
1971-1980
307,837
233,097, 382,577
<0.001
1981-1990
293,302
217,634, 368,971
<0.001
1991-2000
349,181
270,678, 427,684
<0.001
2001-2005
470,190
374,648, 565,732
<0.001
2006-2010
566,849
483,302, 650,396
<0.001
2011-2015
590,621
509,490, 671,753
<0.001
2016-2024
456,872
392,951, 520,793
<0.001
Demographic_cluster
77,678
64,328, 91,028
<0.001
Political_cluster
-60,392
-70,381, -50,402
<0.001
Tax_cluster
-68,401
-82,477, -54,324
<0.001
1 CI = Confidence Interval
Significant Predictors of Price:
Square meters: The most influential variable with a strong positive effect on price.
Property types: Some types significantly affect prices, with villas and attic flats increasing prices, while single houses and farm houses decrease prices.
Year category: Newer properties consistently have higher prices, with significant positive impacts for all categories compared to the baseline.
5.1.2.2.1 Assess Overfitting
Click to show code
# For the Linear Modellm_train_pred <-predict(lm_model_1, newdata = trainData)lm_test_pred <-predict(lm_model_1, newdata = testData)# Calculate RMSE and R-squared for Training Datalm_train_rmse <-sqrt(mean((trainData$price - lm_train_pred)^2))lm_train_rsquared <-summary(lm(lm_train_pred ~ trainData$price))$r.squared# Calculate RMSE and R-squared for Test Datalm_test_rmse <-sqrt(mean((testData$price - lm_test_pred)^2))lm_test_rsquared <-summary(lm(lm_test_pred ~ testData$price))$r.squared# show the results in a tableresults_table <-data.frame(Model =c("Linear Regression"),RMSE_Train = lm_train_rmse,RMSE_Test = lm_test_rmse,Rsquared_Train = lm_train_rsquared,Rsquared_Test = lm_test_rsquared)#show table in htmlkable(results_table, format ="html") %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed"))
Model
RMSE_Train
RMSE_Test
Rsquared_Train
Rsquared_Test
Linear Regression
959505
980302
0.493
0.48
No overfitting is observed as the RMSE and R-squared values are similar between the training and test sets, indicating that the model generalizes well to new data.
5.1.2.2.2 Metrics
Here are the performance metrics for the initial model: ::: {.cell layout-align=“center”}
Click to show code
# print R-squared and Adj R-squared and RMSE and MAE and AICr_sq <-summary(lm_model_1)$r.squaredadj_r_sq <-summary(lm_model_1)$adj.r.squaredrmse <-rmse(testData$price, predict(lm_model_1, newdata=testData))mae <-mae(testData$price, predict(lm_model_1, newdata=testData))aic <-AIC(lm_model_1)# show those metrics in a html tablemetrics_1 <-kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format ="html", col.names =c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed")) %>%add_header_above(c("Basic Model Performance Metrics"=5)) metrics_1
Basic Model Performance Metrics
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.493
0.492
980302
491715
500701
:::
The model has moderate explanatory power, with an R-Squared of 0.493 and an Adjusted R-Squared of 0.492.
Prediction accuracy, as indicated by RMSE (980,302) and MAE (491,715), shows that the model’s predictions are reasonably close to actual prices but could be improved.
The AIC value (500,701) provides a benchmark for comparing with other models
5.1.2.3 Hyperparameter Tuning
5.1.2.3.1 Stepwise Regression
Stepwise regression was performed to refine the model and improve its predictive performance. The resulting model was evaluated using the same diagnostic checks and performance metrics as the initial model:
The stepwise model drops the number_of_rooms and floor variables, indicating that these predictors may not significantly influence property prices.
Consistency in Key Predictors
The primary significant predictors (e.g., square_meters, certain property types, and year categories) remain consistent across both models, indicating their robust influence on property prices.
Similar Effect Sizes
The effect sizes (β) and confidence intervals for significant predictors are similar in both models, reinforcing the reliability of these predictors.
5.1.2.3.2 Lasso and Ridge Regression
A Lasso and Ridge regression were also performed to compare the performance of the linear regression model with regularization techniques. We fit both models using cross-validation to determine the optimal lambda (penalty parameter). The plots show the lambda selection process for both Lasso and Ridge models. ::: {.cell layout-align=“center”}
Click to show code
# Convert data frames to matrices for glmnetdat_tr_re_mat_x <-as.matrix(trainData[, c("number_of_rooms", "square_meters", "floor", "year_category", "Demographic_cluster", "Political_cluster", "Tax_cluster")])dat_tr_re_mat_y <- trainData$pricedat_te_re_mat_x <-as.matrix(testData[, c("number_of_rooms", "square_meters", "floor", "year_category", "Demographic_cluster", "Political_cluster", "Tax_cluster")])dat_te_re_mat_y <- testData$price#fit lasso and ridgeset.seed(123) # For reproducibility# Fit Lasso modellasso_model <-cv.glmnet(dat_tr_re_mat_x, dat_tr_re_mat_y, alpha =1) # Lasso#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion# Fit Ridge modelridge_model <-cv.glmnet(dat_tr_re_mat_x, dat_tr_re_mat_y, alpha =0) # Ridge#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercionridge_fit_best <-glmnet(x=dat_tr_re_mat_x, y = dat_tr_re_mat_y, lambda = ridge_model$lambda.min)#> Warning in storage.mode(xd) <- "double": NAs introduced by coercionlasso_fit_best <-glmnet(x=dat_tr_re_mat_x, y=dat_tr_re_mat_y, lambda = lasso_model$lambda.min) #can also use lasso_fit$lambda.1se#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion# lasso & ridge performance on the training setpostResample(predict(ridge_fit_best, newx = dat_tr_re_mat_x), dat_tr_re_mat_y)#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> RMSE Rsquared MAE #> 9.89e+05 4.65e-01 5.06e+05postResample(predict(lasso_fit_best, newx = dat_tr_re_mat_x), dat_tr_re_mat_y)#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> RMSE Rsquared MAE #> 9.78e+05 4.73e-01 4.99e+05# lasso & ridge performance on the test setpostResample(predict(ridge_fit_best, newx = dat_te_re_mat_x), dat_te_re_mat_y)#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> RMSE Rsquared MAE #> 1.00e+06 4.64e-01 5.02e+05postResample(predict(lasso_fit_best, newx = dat_te_re_mat_x), dat_te_re_mat_y)#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> RMSE Rsquared MAE #> 9.93e+05 4.67e-01 4.95e+05# Step-wise lm performance on training and test setspostResample(predict(lm_model_2,trainData), dat_tr_re_mat_y)#> RMSE Rsquared MAE #> 9.60e+05 4.93e-01 4.96e+05postResample(predict(lm_model_2,testData), dat_te_re_mat_y)#> RMSE Rsquared MAE #> 9.81e+05 4.80e-01 4.92e+05
::: We then compared the performance of the Lasso and Ridge models with the stepwise linear regression model using RMSE and MAE:
Click to show code
# Calculate RMSE and MAEget_metrics <-function(predictions, actuals) { RMSE <-sqrt(mean((predictions - actuals)^2)) MAE <-mean(abs(predictions - actuals)) Rsquared <-cor(predictions, actuals)^2return(c(RMSE = RMSE, MAE = MAE, Rsquared = Rsquared) )}# Capture the performance metricsridge_train_preds <-predict(ridge_fit_best, newx = dat_tr_re_mat_x)#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercionlasso_train_preds <-predict(lasso_fit_best, newx = dat_tr_re_mat_x)#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercionridge_test_preds <-predict(ridge_fit_best, newx = dat_te_re_mat_x)#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercionlasso_test_preds <-predict(lasso_fit_best, newx = dat_te_re_mat_x)#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercionlm_train_preds <-predict(lm_model_2, trainData)lm_test_preds <-predict(lm_model_2, testData)ridge_train_metrics <-get_metrics(ridge_train_preds, dat_tr_re_mat_y)lasso_train_metrics <-get_metrics(lasso_train_preds, dat_tr_re_mat_y)ridge_test_metrics <-get_metrics(ridge_test_preds, dat_te_re_mat_y)lasso_test_metrics <-get_metrics(lasso_test_preds, dat_te_re_mat_y)lm_train_metrics <-get_metrics(lm_train_preds, dat_tr_re_mat_y)lm_test_metrics <-get_metrics(lm_test_preds, dat_te_re_mat_y)# Create a data frame with the performance metricsperformance_df <-data.frame(Model =c("Ridge (Training)", "Lasso (Training)", "Ridge (Test)", "Lasso (Test)", "Stepwise (Training)", "Stepwise (Test)"),RMSE =c(ridge_train_metrics["RMSE"], lasso_train_metrics["RMSE"], ridge_test_metrics["RMSE"], lasso_test_metrics["RMSE"], lm_train_metrics["RMSE"], lm_test_metrics["RMSE"]),MAE =c(ridge_train_metrics["MAE"], lasso_train_metrics["MAE"], ridge_test_metrics["MAE"], lasso_test_metrics["MAE"], lm_train_metrics["MAE"], lm_test_metrics["MAE"]),Rsquared =c(ridge_train_metrics["Rsquared"], lasso_train_metrics["Rsquared"], ridge_test_metrics["Rsquared"], lasso_test_metrics["Rsquared"], lm_train_metrics["Rsquared"], lm_test_metrics["Rsquared"]))# Create the kable extra tableperformance_table_hyp_tune <-kable(performance_df, format ="html") %>%kable_styling(full_width =FALSE, position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed")) %>%add_header_above(c( "Performance Metrics (RMSE, MAE, R-sq)"=4))# Print the tableperformance_table_hyp_tune
Performance Metrics (RMSE, MAE, R-sq)
Model
RMSE
MAE
Rsquared
Ridge (Training)
989234
505694
0.465
Lasso (Training)
977821
498800
0.473
Ridge (Test)
1003230
502454
0.464
Lasso (Test)
993029
495376
0.467
Stepwise (Training)
959549
496015
0.493
Stepwise (Test)
980670
491579
0.480
The Stepwise model is the preferred choice for predicting property prices based on the provided metrics. It offers the best balance of accuracy and predictive power. Lasso regression is a good alternative, particularly if model simplicity and interpretability are priorities due to its ability to shrink coefficients and eliminate irrelevant features. Ridge regression, while slightly less accurate, provides stable performance and handles multicollinearity well, good for our tax_cluster variable with high VIF.
5.1.2.3.3 Metrics
Here we compare the performance metrics of the initial model and the stepwise model. The metrics of our initial model : ::: {.cell layout-align=“center”}
Click to show code
metrics_1
Basic Model Performance Metrics
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.493
0.492
980302
491715
500701
:::
Stepwise model: ::: {.cell layout-align=“center”}
Click to show code
# print R-squared and Adj R-squared and RMSE and MAE and AICr_sq <-summary(lm_model_2)$r.squaredadj_r_sq <-summary(lm_model_2)$adj.r.squaredrmse <-rmse(testData$price, predict(lm_model_2, newdata=testData))mae <-mae(testData$price, predict(lm_model_2, newdata=testData))aic <-AIC(lm_model_2)# show those metrics in a html tablemetrics_2 <-kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format ="html", col.names =c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed")) %>%add_header_above(c("Stepwise Model Performance Metrics"=5)) metrics_2
Stepwise Model Performance Metrics
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.493
0.492
980670
491579
500699
:::
The Stepwise model offers a slight improvement over the Basic model in terms of prediction accuracy (lower MAE) and model efficiency (lower AIC). Both models perform similarly in explaining the variance in property prices, with nearly identical R-Squared and Adjusted R-Squared values. Given the minimal differences, the choice between models may depend on the preference for model simplicity (Stepwise model) versus a more comprehensive approach (Basic model).
5.1.2.4 Cross-Validation
Cross-validation was used to assess the model’s robustness, typically to avoid overfitting and ensure that the model generalizes well to new data., using the caret package to manage this process efficiently. The CV results show metrics like RMSE and R-squared across folds, which indicate the model’s performance stability across different subsets of the data.
Here are the performance metrics for the stepwise model: ::: {.cell layout-align=“center”}
Click to show code
metrics_2
Stepwise Model Performance Metrics
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.493
0.492
980670
491579
500699
:::
Better Prediction Accuracy
10-fold cross validation shows a lower RMSE compared to both Basic and Stepwise models.
Slightly Lower Explanatory Power
R-Squared is marginally lower in the cross-validated model.
We can thus say that the model generalizes well to new data, with improved prediction accuracy compared to the initial models.
5.1.2.5 Model testing
We chose the stepwise model as the best model for the linear regresion due to its balance of accuracy and simplicity.
The final model was tested using the unseen test dataset to evaluate its predictive accuracy. Residual plots and actual vs. predicted price plots were created to visually assess model performance:
5.1.2.5.1 Residual vs Predicted Prices
This plot shows residuals (differences between observed and predicted prices) against predicted prices. Ideally, residuals should randomly scatter around the horizontal line at zero, indicating that the model doesn’t systematically overestimate or underestimate prices.
Click to show code
# Model Testing ## Test the Modelpredicted_prices <-predict(lm_model_2, newdata=testData)testData$predicted_prices <- predicted_prices # Add to testData to ensure alignment# Calculate residualstestData$test_residuals <- testData$price - predicted_prices # Manually compute residuals# Residual Analysisgg <-ggplot(data = testData, aes(x = predicted_prices, y = test_residuals)) +geom_point() +geom_smooth(method ="lm", color ="blue") +labs(title ="Residuals vs Predicted Prices", x ="Predicted Prices", y ="Residuals")# Convert ggplot to plotlyp <-ggplotly(gg, width =600, height =400)# Show the interactive plotp
The residual plot for all cantons shows a pronounced trend of increasing residuals with higher predicted prices, along with more extreme outliers. This indicates potential model misfit or heteroscedasticity issues when considering all cantons. In contrast,
5.1.2.5.2 Actual vs Predicted Prices
This plot should ideally show points along the diagonal line, where actual prices equal predicted prices. The data clustering along the line suggests a generally good model fit, but here we can observe the spread which indicates variance in predictions, especially at higher price points.
Click to show code
## Visualizationgg <-ggplot(data=testData, aes(x=predicted_prices, y=price)) +geom_point() +geom_smooth(method="lm", col="blue") +labs(title="Actual vs Predicted Prices", x="Predicted Prices", y="Actual Prices")# Convert ggplot to plotlyp <-ggplotly(gg, width =600, height =400)# Show the interactive plotp
The actual vs. predicted prices plot shows a similar trend, with a wider spread of points at higher price points, indicating that the model may not perform as well in predicting prices for more expensive properties.
5.1.3 Linear Regression for Specific Canton
To solve this issue of variance at higher price points, we can filter the data to focus on a more specific range of canton. Specifically cantons valais, tessin, vaud, Berne, Fribourg to see if the model performs better within this range.
Indeed, as seen in the EDA section, these cantons have more properties and show similar price distributions.
Click to show code
#selectt properties_filtered based on the canton valais, tessin, vaud, Berne, Fribourgproperties_filtered_2 <- properties_filtered %>%filter(canton %in%c("Valais", "Ticino", "Vaud", "Bern", "Fribourg"))#show each unique value in canton colunique(properties_filtered_2$canton)#> [1] "Vaud" "Fribourg" "Bern" "Valais" "Ticino"
5.1.3.1 Model Building and Evaluation
We then repeat the model building and evaluation process for this filtered dataset to compare the performance with the initial (best) model: ::: {.cell layout-align=“center”}
# For the Linear Modellm_train_pred <-predict(lm_model_1.1, newdata = trainData)lm_test_pred <-predict(lm_model_1.1, newdata = testData)# Calculate RMSE and R-squared for Training Datalm_train_rmse <-sqrt(mean((trainData$price - lm_train_pred)^2))lm_train_rsquared <-summary(lm(lm_train_pred ~ trainData$price))$r.squared# Calculate RMSE and R-squared for Test Datalm_test_rmse <-sqrt(mean((testData$price - lm_test_pred)^2))lm_test_rsquared <-summary(lm(lm_test_pred ~ testData$price))$r.squared# show the results in a tableresults_table <-data.frame(Model =c("Linear Regression"),RMSE_Train = lm_train_rmse,RMSE_Test = lm_test_rmse,Rsquared_Train = lm_train_rsquared,Rsquared_Test = lm_test_rsquared)#show table in htmlkable(results_table, format ="html") %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed"))
Model
RMSE_Train
RMSE_Test
Rsquared_Train
Rsquared_Test
Linear Regression
968636
959284
0.51
0.556
No overfitting is observed as the RMSE and R-squared values are similar between the training and test sets, indicating that the model generalizes well to new data.
5.1.3.2.1 Metrics
Here are the performance metrics for the model with prices between the 10th and 90th percentiles: ::: {.cell layout-align=“center”}
Click to show code
# print R-squared and Adj R-squared and RMSE and MAE and AICr_sq <-summary(lm_model_1.1)$r.squaredadj_r_sq <-summary(lm_model_1.1)$adj.r.squaredrmse <-rmse(testData$price, predict(lm_model_1.1))#> Warning in actual - predicted: longer object length is not a#> multiple of shorter object lengthmae <-mae(testData$price, predict(lm_model_1.1, newdata=testData))aic <-AIC(lm_model_1.1)# show those metrics in a html tablemetrics_1.1<-kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format ="html", col.names =c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed")) %>%add_header_above(c("Basic Model Performance Metrics (Selected Canton)"=5)) metrics_1.1
Basic Model Performance Metrics (Selected Canton)
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.51
0.509
1734053
497329
313870
:::
Here was the previous metrics of our first Basic model (without the 10-90 Qt filter) ::: {.cell layout-align=“center”}
Click to show code
metrics_2
Stepwise Model Performance Metrics
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.493
0.492
980670
491579
500699
:::
We observe a very slight increase in r-squared, a big drop in RMSE, and a other drop in AIC, indicating that the model may perform better in predicting property prices within this specific canton range.
We observe that number_of_rooms is not dropped as in the previous stepwise model, indicating that this variable may have a more significant impact on property prices within this specific canton range. But floor is still dropped.
5.1.3.3.1 Metrics
Here are the performance metrics for the stepwise model with prices between the 10th and 90th percentiles as well as the comparison with the initial model: ::: {.cell layout-align=“center”}
Click to show code
## Performance Metricsr_sq <-summary(lm_model_2.1)$r.squaredadj_r_sq <-summary(lm_model_2.1)$adj.r.squaredrmse <-rmse(testData$price, predict(lm_model_2.1))#> Warning in actual - predicted: longer object length is not a#> multiple of shorter object lengthmae <-mae(testData$price, predict(lm_model_2.1, newdata=testData))aic <-AIC(lm_model_2.1)# show those metrics in a html tablemetrics_2.1<-kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format ="html", col.names =c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed")) %>%add_header_above(c("Stepwise Model Performance Metrics (Selected Canton)"=5))metrics_2.1
Stepwise Model Performance Metrics (Selected Canton)
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.51
0.509
1733970
497318
313868
:::
Here was the previous metrics of our Basic Model (without Stepwise Integration)
Click to show code
metrics_1.1
Basic Model Performance Metrics (Selected Canton)
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.51
0.509
1734053
497329
313870
Here was the previous metrics of our stepwise model (without selecting for cantons) ::: {.cell layout-align=“center”}
Click to show code
metrics_2
Stepwise Model Performance Metrics
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.493
0.492
980670
491579
500699
::: - R-Squared and Adjusted R-Squared:
- Both the Basic Model and Stepwise Model for the Selected Canton have higher R-Squared (0.51) and Adjusted R-Squared (0.509) compared to the Stepwise Model (0.493 and 0.492).
RMSE and MAE:
The RMSE for the Stepwise Model (980,670) is significantly lower than both the Basic Model (1,734,053) and the Stepwise Model for the Selected Canton (1,733,970), indicating better predictive accuracy.
The MAE is slightly lower in the Stepwise Model (491,579) compared to the Basic Model (497,329) and Stepwise Model for the Selected Canton (497,318), indicating marginally better performance in error terms.
AIC:
The AIC for the Stepwise Model (500,699) is significantly higher compared to both the Basic Model (313,870) and the Stepwise Model for the Selected Canton (313,868), indicating that the Stepwise Model might be less preferable when considering model complexity and goodness of fit.
Here was the previous metrics of our first Basic Model (without selectig for Cantons) : ::: {.cell layout-align=“center”}
Click to show code
metrics_cv_1
10 Fold Cross Validation Metrics
intercept
RMSE
Rsquared
MAE
RMSESD
RsquaredSD
MAESD
TRUE
955116
0.496
497284
125443
0.054
24808
::: The 10-Fold Cross Validation for the Selected Canton shows a slightly higher R-Squared (0.51) compared to the general model (0.496), indicating better explanatory power, with a lower RMSE standard deviation (66,342 vs. 125,443) suggesting more consistent performance across the folds.
5.1.3.5 Model testing
5.1.3.5.1 Residual vs Predicted Prices
Click to show code
# Model Testing ## Test the Modelpredicted_prices <-predict(lm_model_2.1, newdata=testData)testData$predicted_prices <- predicted_prices # Add to testData to ensure alignment# Calculate residualstestData$test_residuals <- testData$price - predicted_prices # Manually compute residuals# Residual Analysisgg <-ggplot(data = testData, aes(x = predicted_prices, y = test_residuals)) +geom_point() +geom_smooth(method ="lm", color ="blue") +labs(title ="Residuals vs Predicted Prices", x ="Predicted Prices", y ="Residuals")# Convert ggplot to plotlyp <-ggplotly(gg, width =600, height =400)# Show the interactive plotp
The residual plot for the selected canton (second image) displays a tighter clustering of residuals around the zero line and fewer extreme outliers, suggesting a more consistent and reliable model performance for the selected canton compared to the overall model.
5.1.3.5.2 Actual vs Predicted Prices
Click to show code
## Visualizationgg <-ggplot(data=testData, aes(x=predicted_prices, y=price)) +geom_point() +geom_smooth(method="lm", col="blue") +labs(title="Actual vs Predicted Prices", x="Predicted Prices", y="Actual Prices")# Convert ggplot to plotlyp <-ggplotly(gg, width =600, height =400)# Show the interactive plotp
The points are even more concentrated that for the model with all cantons, indicating that the model performs better in predicting prices for this specific canton range. We still see a spread of points at higher price points, indicating variance in predictions. Which is confirmed by the previous metrics which doesn’t change much.
5.1.4 Conclusion
For our first model, we used a linear regression model to predict property prices based on various features. We evaluated the model using performance metrics such as R-squared, RMSE, and MAE to assess its predictive accuracy and explanatory power. The model does not perform well even if reducing the numbers of canton taken into account.
5.2 Model 2
5.2.1 Random Forest
We decided to implement a Random Forest model to compare its performance with the linear regression model. Random Forest is an ensemble learning method that builds multiple decision trees during training and outputs the mode of the classes or the mean prediction of the individual trees. This model is known for its robustness and ability to handle complex relationships in the data.
5.2.1.1 Model Building and Evaluation
5.2.1.1.1 Data Splitting and Fitting
We split the data into training and testing sets, fit the Random Forest model and then evaluated the model using performance metrics such as R-squared, RMSE, and MAE to assess its predictive accuracy and explanatory power. ::: {.cell layout-align=“center”}
Click to show code
#split the data in training and test set 0.8/0.2set.seed(123) # for reproducibilitytrainIndex <-createDataPartition(properties_filtered_2$price, p=0.8, list=FALSE)trainData <- properties_filtered_2[trainIndex, ]testData <- properties_filtered_2[-trainIndex, ]#apply the RF model as a regressionrf_model <-randomForest(price ~., data=trainData, ntree=500, importance=TRUE)rf_model.pred_rf <-predict(rf_model, newdata=testData)rmse_rf <-sqrt(mean((testData$price - rf_model.pred_rf)^2))mae_rf <-mean(abs(testData$price - rf_model.pred_rf))r_sq_rf <-cor(testData$price, rf_model.pred_rf)^2# show those metrics in a html tablemetrics_rf <-kable(data.frame(r_sq_rf, rmse_rf, mae_rf), format ="html", col.names =c("R-Squared", "RMSE", "MAE")) %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed")) %>%add_header_above(c("Random Forest Model Performance Metrics"=3))metrics_rf
Random Forest Model Performance Metrics
R-Squared
RMSE
MAE
0.758
720509
326754
:::
5.2.1.1.2 Evaluation Metrics and Comparison
Comparing with the best model of the linear regression, we can see that the Random Forest model has a higher R-squared value and lower RMSE and MAE values, indicating better predictive accuracy. ::: {.cell layout-align=“center”}
Click to show code
metrics_2
Stepwise Model Performance Metrics
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.493
0.492
980670
491579
500699
:::
The plot shows the actual vs. predicted prices, with the diagonal line indicating perfect predictions. ::: {.cell layout-align=“center”}
Click to show code
plot(testData$price ~rf_model.pred_rf, col ='skyblue',xlab ='Actual Price', ylab ='Predicted Price',main ='Actual vs Predicted Price')abline(0,1, col ='darkred')
:::
5.2.1.2 Variable Importance
5.2.1.2.1 Importance of Predictors
VI plots are a useful tool to understand the relative importance of predictors in the Random Forest model. This plot shows the importance of each predictor in the model, helping to identify key features that drive price predictions. ::: {.cell layout-align=“center”}
Click to show code
varImpPlot(rf_model)
::: We see that square_meters is the most important predictor.
5.2.1.2.2 Interpretation of Key Features
Click to show code
# Get the importance of each featureimportance(rf_model)#> %IncMSE IncNodePurity#> zip_code 29.41 1.20e+15#> number_of_rooms 30.29 2.47e+15#> square_meters 62.88 7.84e+15#> address 22.49 8.65e+14#> canton 9.39 1.98e+14#> property_type 24.24 8.19e+14#> floor 17.70 4.84e+14#> year_category 45.38 1.68e+15#> Community 30.84 7.05e+14#> Canton_code 8.10 1.73e+14#> lon 28.67 9.25e+14#> lat 19.54 1.27e+15#> Demographic_cluster 29.95 3.56e+14#> Political_cluster 18.49 1.24e+14#> Tax_cluster 4.37 5.40e+13
5.2.1.3 Cross-Validation
A SUPPRIMER TOO LONG OU TENTER OTHER APPROACH ::: {.cell layout-align=“center”}
Click to show code
# cv_results_rf <- train(price ~., data=trainData, method="rf", trControl=trainControl(method="cv", number=5))# summary(cv_results_rf)# # #show the CV result in the html# metrics_cv_rf <- kable(cv_results_rf$results, format = "html") %>%# kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed")) %>%# add_header_above(c("10 Fold Cross Validation Metrics (Random Forest)" = 7))# metrics_cv_rf
:::
EVALUATE MODEL STABILITY
5.2.1.4 Hyperparameter Tuning
5.2.1.4.1 Tuning Hyperparameters (mtry, ntree)
A SUPPRIMER TOO LONG OU TENTER OTHER APPROACH ::: {.cell layout-align=“center”}
Click to show code
# # Define the tuning grid# tuneGrid <- expand.grid(mtry = seq(2, sqrt(ncol(trainData)), by = 1)) # Tune over a range of mtry values# # # Train the model with tuning# rf_tuned <- train(price ~ ., data = trainData, method = "rf", # trControl = trainControl(method = "cv", number = 5, search = "grid"), # tuneGrid = tuneGrid, # ntree = 1000)# # # Plotting the tuning effect# plot(rf_tuned)# # # Evaluate the tuned model# rf_model_pred <- predict(rf_tuned, newdata = testData)# rmse_rf <- sqrt(mean((testData$price - rf_model_pred)^2))# mae_rf <- mean(abs(testData$price - rf_model_pred))# r_sq_rf <- cor(testData$price, rf_model_pred)^2# # # Show metrics# metrics_rf <- kable(data.frame(R_Squared = r_sq_rf, RMSE = rmse_rf, MAE = mae_rf),# format = "html", col.names = c("R-Squared", "RMSE", "MAE")) %>%# kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed")) %>%# add_header_above(c("Tuned Random Forest Model Performance Metrics" = 3))# metrics_rf
:::
5.2.1.4.2 Grid Search Methodology
Click to show code
# Grid search methodology
5.2.1.4.3 Evaluating Tuned Model Performance
6 Conclusion
Brief summary of the project
Take home message
Limitations
Future work?
6.1 Results
TODOS: Present All Results:
Include both positive and negative results without further data or method descriptions.
Use Subsections:
Structure the section with subsections for clarity, especially if there are many results.
Highlight Key Figures:
Show the most interesting and relevant figures in the main results section. Use Appendix for Additional Figures:
Place additional figures (e.g., extensive time series) in the appendix or supplementary material and reference them appropriately. Include Technical Interpretations:
Provide limited technical interpretations of the results, saving broader discussions for the next section.
7 Recommendations and Discussion
Summary of what was observed from the results
Implications for the business (link with the context and the initial objectives of the study).
Limitations (this may be the subject of a separated section). An opening is often expected here: “What next?” “What can be done for improvement”, etc.?